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Abstract. Previous studies on multi- instance learning typically treated 
instances in the bags as independently and identically dtstnbuted. The 
instances in a bag, however, are rarely independent in real tasks, and 
a better performance can be expected if the instances are treated in 
an non-i.i.d. way that exploits relations among instances. In this paper, 
we propose two simple yet effective methods. In the first method, we 
Tj • explicitly map every bag to an undirected graph and design a graph 

, 1 I kernel for distinguishing the positive and negative bags. In the second 

method, we implicitly construct graphs by deriving affinity matrices and 
propose an efficient graph kernel considering the clique information. The 
effectiveness of the proposed methods are validated by experiments. 
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In multi-instance learning [11], each training example is a bag of instances. A 
bag is positive if it contains at least one positive instance, and negative other- 
wise. Although the labels of the training bags are known, however, the labels 
(^ • of the instances in the bags are unknown. The goal is to construct a learner to 

00 I classify unseen bags. Multi-instance learning has been found useful in diverse 

^^ ■ domains such as image categorization [6,7], image retrieval [35] , text catego- 

rization [2,24], computer security [22], face detection [27,32], computer-aided 
. , ■ medical diagnosis [12], etc. 

r> I A prominent advantage of multi-instance learning mainly lies in the fact that 

j^ ■ many real objects have inherent structures, and by adopting the multi-instance 

representation we are able to represent such objects more naturally and capture 
more information than simply using the flat single-instance representation. For 
example, suppose we can partition an image into several parts. In contrast to 
representing the whole image as a single-instance, if we represent each part as 
an instance, then the partition information is captured by the multi-instance 
representation; and if the partition is meaningful (e.g., each part corresponds to 
a region of saliency) , the additional information captured by the multi- instance 
representation may be helpful to make the learning task easier to deal with. 

It is obviously not a good idea to apply multi-instance learning techniques 
everywhere since if the single- instance representation is sufficient, using multi- 
instance representation just gilds the lily. Even on tasks where the objects have 



inherent structures, we should keep in mind that the power of multi- instance 
representation exists in its abihty of capturing some structure information. How- 
ever, as Zhou and Xu [36] indicated, previous studies on multi-instance learning 
typically treated the instances in the bags as independently and identically dis- 
tributed; this neglects the fact that the relations among the instances convey im- 
portant structure information. Considering the above image task again, treating 
the different image parts as inter-correlated samples is evidently more meaning- 
ful than treating them as unrelated samples. Actually, the instances in a bag are 
rarely independent, and a better performance can be expected if the instances 
are treated in an non-i.i.d. way that exploits the relations among instances. 

In this paper, we propose two multi-instance learning methods which do not 
treat the instances as i.i.d. samples. Our basic idea is to regard each bag as 
an entity to be processed as a whole, and regard instances as inter-correlated 
components of the entity. Experiments show that our proposed methods achieve 
performances highly competitive with state-of-the-art multi-instance learning 
methods. 

The rest of this paper is organized as follows. We briefly review related work 
in Section 2, propose the new methods in Section 3, report on our experiments 
in Section 4, conclude the paper finally in Section 5. 

2 Related Work 

Many multi-instance learning methods have been developed during the past 
decade. To name a few. Diverse Density [16], /c- nearest neighbor algorithm 
Citation-fcNN [29], decision trees RELIC [22] and MITI [4], neural networks 
BP-MIP and RBF-MIP [33], rule learning algorithm RIPPER-MI [9], ensemble 
algorithms MIBoosting [31] and MILBoosting [3], logistic regression algorithm 
MI-LR [20], etc. 

Kernel methods for multi-instance learning have been studied by many re- 
searchers. Gartner et al. [14] defined the Mi-Kernel by regarding each bag as 
a set of feature vectors and then applying set kernel directly. Andrews et al. 
[2] proposed mi-SVM and MI-SVM. mi-SVM tries to identify a maximal mar- 
gin hyperplane for the instances with subject to the constraints that at least 
one instance of each positive bag locates in the positive half-space while all 
instances of negative bags locate in the negative half-space; MI-SVM tries to 
identify a maximal margin hyperplane for the bags by regarding margin of the 
"most positive instance" in a bag as the margin of that bag. Cheung and Kwok 
[8] argued that the sign instead of value of the margin of the most positive in- 
stance was important. They defined a loss function which allowed bags as well 
as instances to participate in the optimization process, and used the well-formed 
constrained concave-convex procedure to perform the optimization. Later, Kwok 
and Cheung [15] designed marginalized multi-instance kernels by incorporating 
generative model into the kernel design. Chen and Wang [7] proposed the DD- 
SVM method which employed Diverse Density [16] to learn a set of instance 
prototypes and then maps the bags to a feature space based on the instance pro- 



totypes. Zhou and Xu [36] proposed the MissSVM method by regarding instances 
of negative bags as labeled examples while those of positive bags as unlabeled 
examples with positive constraints. Wang et al. [28] proposed the PPMM ker- 
nel by representing each bag as some aggregate posteriors of a mixture model 
derived based on unlabeled data. 

In addition to classification, multi-instance regression has also been studied 
[1,21], and different versions of generalized multi-instance learning have been 
defined [30,23]. The main difference between standard multi- instance learning 
and generalized multi-instance learning is that in standard multi-instance learn- 
ing there is a single concept, and a bag is positive if it has an instance satisfies 
this concept; while in generalized multi-instance learning [30, 23] there are mul- 
tiple concepts, and a bag is positive only when all concepts are satisfied (i.e., 
the bag contains instances from every concept). Recently, research on multi- 
instance semi-supervised learning [19], multi-instance active learning [24] and 
multi-instance multi-label learning [37] have also been reported. In this paper 
we mainly work on standard multi- instance learning [11] and will show that 
our methods are also applicable to multi-instance regression. Actually it is also 
possible to extend our proposal to other variants of multi-instance learning. 

Zhou and Xu [36] indicated that instances in a bag should not be treated 
as i.i.d. samples, and this paper provides a solution. Our basic idea is to regard 
every bag as an entity to be processed as a whole. There are alternative ways to 
realize the idea, while in this paper we work by regarding each bag as a graph. 
McGovern and Jensen [17] have taken multi-instance learning as a tool to handle 
relational data where each instance is given as a graph. Here, we are working on 
propositional data and there is no natural graph. In contrast to having instances 
as graphs, we regard every bag as a graph and each instance as a node in the 
graph. 

3 The Proposed Methods 

In this section we propose the MIGraph and miGraph methods. The MIGraph 
method explicitly maps every bag to an undirected graph and uses a new graph 
kernel to distinguish the positive and negative bags. The miGraph method im- 
plicitly constructs graphs by deriving affinity matrices and defines an efficient 
graph kernel considering the clique information. 

Before presenting the details, we give the formal definition of multi-instance 
learning as following. Let X denote the instance space. Given a data set {{Xi,yi), 
■■■,{Xi,yi),---, (XAr,yjv)}, where Xi = {xii, • • ■ ,£Cy , • ■ • ,a7i,„J C A* is called 
a bag and yt & y — { — 1, +1} is the label of Xi, the goal is to generate a learner 
to classify unseen bags. Here Xij G A" is an instance [x^i, • • • , Xiji, • • • , Xijd] , Xiji 
is the value of Xij at the Zth attribute, N is the number of training bags, rii is 
the number of instances in Xi, and d is the number of attributes. If there exists 
g e {!,■■■ ,ni} such that Xig is a positive instance, then Xi is a positive bag 
and thus yi ~ +1; otherwise j/i = —1. Yet the concrete value of the index g is 
unknown. 



Fig. 1. Example images with six marked patches each corresponding to an instance 




Fig. 2. If we do not consider the relations among the instances, the three bags are 
similar to each other since they have identical number of very similar instances 




Fig. 3. If we consider the relations among the instances, the first two bags are more 
similar than the third bag. Here, the solid lines highlight the high affinity among similar 
instances 



Wc first explain our intuition of the proposed methods. Here, we use the 
three example images shown in Fig. 1 for illustration. For simplicity, we show 
six marked patches in each figure, and assume that each image corresponds to a 
bag, each patch corresponds to an instance in the bag, and the marked patches 
with the same color are very similar (real cases are of course more complicated, 
but the essentials are similar as the illustration) . If the instances were treated as 
independent samples then Fig. 1 can be abstracted as Fig. 2, which is the typical 
way taken by previous multi-instance learning studies, and obviously the three 
bags are similar to each other since they contain identical number of very similar 
instances. However, if we consider the relations among the instances, we can find 
that in the first two bags the blue marks are very close to each other while in 
the third bag the blue marks scatter among orange marks, and thus the first 
two bags should be more similar than the third bag. In this case. Fig. 1 can be 
abstracted by Fig. 3. It is evident that the abstraction in Fig. 3 is more desirable 
than that in Fig. 2. Here the essential is that, the relation structures of bags 
belonging to the same class are relatively more similar, while those belonging to 
different classes are relatively more dissimilar. 

Now we describe the MIGraph method. The first step is to construct a graph 
for each bag. Inspired by [26] which shows that e-graph is helpful for discov- 
ering the underlying manifold structure of data, here we establish an e-graph 
for every bag. The process is quite straightforward. For a bag X^, we regard 
every instance of it as a node. Then, we compute the distance of every pair of 



nodes, e.g., Xi^ and Xiy. If the distance between Xi^ and Xi^ is smaller than 
a pre-set threshold e, then an edge is established between these two nodes, 
where the weight of the edge expresses the affinity of the two nodes (in ex- 
periments we use the normalized reciprocal of non-zero distance as the affinity 
value). Many distance measures can be used to compute the distances. Accord- 
ing to the manifold property [26], i.e., a small local area is approximately an 
Euclidean space, we use Euclidean distance to establish the e-graph. When cat- 
egorical attributes are involved, we use VDM (Value Difference Metric) [25] 
as a complement. In detail, suppose the first j attributes are categorical while 
the remaining {d — j) ones are continuous attributes normalized to [0,1]. We 

can use {Yj'h^i VDM{xiji, X2.h)+ J2h=j+i \^i,h - X2,h\'^)^^'^ to measure the dis- 
tance between Xi and X2. Here the VDM distance between two values zi and Z2 
on categorical attribute Z can be computed by 



^ — ^c— 1 



Nz,zi,c Nz^z2,c 



Nz^z^ Nz,z2 



(1) 



where Nz,z denotes the number of training examples holding value z on Z^ Nz^z,c 
denotes the number of training examples belonging to class c and holding value 
z on Z, and C denotes the number of classes. 

After mapping the training bags to a set of graphs, we can have many options 
to build a classifier. For example, we can build a /c-nearest neighbor classifier 
that employs graph edit distance [18], or we can design a graph kernel [13] to 
capture the similarity among graphs and then solve classification problems by 
kernel machines such as SVM. The MIGraph method takes the second way, and 
the idea of our graph kernel is illustrated in Fig. 4. 

Briefiy, to measure the similarity between the two bags shown in the left 
part of Fig. 4, we use a node kernel (i.e., knode) to incorporate the information 
conveyed by the nodes, use an edge kernel (i.e., hedge) to incorporate the in- 
formation conveyed by the edges, and aggregate them to obtain the final graph 
kernel (i.e., ko). Formally, we define kc as follows. 

Definition 1. Given two multi-instance bags Xi and Xj which are presented as 
graphs Gh{{xfiu}u=n {6fti>}I!^i); ^ — hj? where n^ and ruh are the number of 
nodes and edges in Gt, respectively. 

E'M i ^ — ^ 'M J 
/ ^, i^nodey-^iaiXji,) 
a— 1 ^ — 'o— 1 

^^^2^^^^kedge{e^a,eJb), (2) 

where knode and k^dge are positive semidefinite kernels. To avoid numerical prob- 
lem, kc is normalized to 

kciX., X,) - kGJX^^X,) 

^kG{X,,X,)^kG{X„X,) 



uode kernel + edge kernel 



Fig. 4. Illustration of the graph kernel in MIGraph 

The knode and k^dge can be defined in many ways. Here we simply define 
knode using Gaussian RBF kernel as 

knodeixia,Xjb) = exp(-7||a;„ - Xjblf), (4) 

and so the first part of Eq. 2 is exactly the Mi-Kernel using Gaussian RBF kernel 
[14]. kedge is also defined in a form as similar as Eq. 4, by replacing Xia and Xji, 
with Bia and Bjh, respectively. 

Here a key is how to define the feature vector describing an edge. In this 
paper, for the edge connecting the nodes Xiu and Xi^j of the bag Xi, we define 
it as [du,Pu, dv,py]' , where d„ is the degree of the node Xiu, that is, the number 
of edges connecting Xiu with other nodes. Note that it has been normalized 
through dividing it by the total number of edges in the graph corresponding to 
Xi. dv is the degree of the node Xi^, which is defined similarly. p„ is defined as 
Pu = Wuv/ X] ''^u,*, where the numerator is the weight of the edge connecting Xiu 
to Xiv] Wu.* is the weight of the edge connecting Xi^ to any nodes in Xi, thus 
the denominator is the sum of all the weights connecting with Xiu- It is evident 
that Pu conveys information on how important (or unimportant) the connection 
with the node Xiy is for the node Xi^. py is defined similarly for the node Xiy. 
The intuition here is that, edges are similar if properties of their ending nodes 
(e.g., high-degree nodes or low-degree nodes) are similar. 

The ko defined in Eq. 2 is a positive definite kernel and it can be used for any 
kinds of graphs. The computational complexity oikQ{Xi, Xj) is 0{ninj+mimj). 
The ka clearly satisfies all the four major properties that should be considered 
for a graph kernel definition [5].^ Our above design is very simple, but in the 
next section we can see that the proposed MIGraph method is quite effective. 

A deficiency of MIGraph is that the computational complexity of ka is 
0{ninj + rriimj), dominated by the number of edges. For bags containing a 
lot of instances, there will exist a large number of edges and MIGraph will be 
hard to execute. So, it is desired to have a method with smaller computational 



^ We have tried to apply some existing graph kernels directly but unfortunately the 
results were not good. Due to the page limit, comparison with different graph kernels 
will be reported in a longer version. 



cost. For this purpose, we propose the miGraph method which is simple, efficient 
but effective. 

For bag Xi, we can calculate the distance between its instances and derive an 
affinity matrix W^ by comparing the distances with a threshold S. For example, 
if the distance between the instances Xia and Xiu is smaller than 5, VF*'s element 
at the ath row and uth column, w^„, is set to 1, and otherwise. There are many 
ways to derive W^ for Xi. In this paper we calculate the distances using Gaussian 
distance, and set 5 to the average distance in the bag. The key of miGraph, the 
kernel kg, is defined as follows. 

Definition 2. Given two multi-instance bags Xi and Xj which contains n^ and 
Uj instances, respectively. 

where Wia = ^/J2u=i'^lu' ^jb = ^/J^vLi^bv "■'^^ k{xia,Xjb) is defined as 
similar as Eq. 4- 

To understand the intuition of kg, it is helpful to consider that once wc have 
got a good graph, instances in one clique can be regarded as belonging to one 
concept. To find the cliques is generally expensive for large graphs, while kg can 
be viewed as an efficient soft version of clique-based graph kernel, where the 
following principles are evidently satisfied: 

- 1) When W^ — I, i.e., every two instances do not belong to the same 
concept, all instances in a bag should be treated equally, i.e., Wia = 1 for every 
instance Xia; 

- 2) When W^ = E {E is all-one matrix), i.e., all instances belong to the same 
concept, each bag can be view as one instance and each instance contributes 
identically, i.e., Wia — ^/ni; 

- 3) When W^ is a block matrix, i.e., instances are clustered into cliques 
each belongs to a concept, Wia = ^/nia where Uia is size of cfique to which Xia 
belongs. In this case, kg is exactly an clique-based graph kernel; 

- 4) When the value of any entries of W^ increases, for example w^^, Wia and 
Wib should decrease since they become more similar, while other Wiq {q ^ a, h) 
should not be affected. 

It is evident that the computational complexity of kg is as similar as that 
of the multi-instance kernel shown in Eq. 4, i.e., 0{ninj). Note that once the 
multi-instance kernel is obtained, the Gaussian distances between every pair of 
instances have already been calculated, and it is easy to get the VF"s. 

4 Experiments 

4.1 Benchmark Tasks 

First, we evaluate the proposed MiGraph and miGraph methods on five bench- 
mark data sets popularly used in studies of multi-instance learning, including 



Table 1. Accuracy (%) on benchmark tasks 



Algorithm Muskl 


Musk2 Elept Fox Tiger 


MIGraph 


90.0 


90.0 


85.1 61.2 81.9 




±3.8 


±2.7 


±2.8 ±1.7 ±1.5 


miGraph 


88.9 


90.3 


86.8 61.6 86.0 




±3.3 


±2.6 


±0.7 ±2.8 ±1.6 


Mi-Kernel 


88.0 


89.3 


84.3 60.3 84.2 




±3.1 


±1.5 


±1.6 ±1.9 ±1.0 


MI-SVM 


77.9 


84.3 


81.4 59.4 84.0 


mi-SVM 


87.4 


83.6 


82.0 58.2 78.9 


MissSVM 


87.6 


80.0 


N/A N/A N/A 


PPMM 


95.6 


81.2 


82.4 60.3 82.4 


DD 


88.0 


84.0 


N/A N/A N/A 


EM-DD 


84.8 


84.9 


78.3 56.1 72.1 



Muskl, Musk2, Elephant, Fox and Tiger. Muskl contains 47 positive and 45 neg- 
ative bags, Musk2 contains 39 positive and 63 negative bags, each of the other 
three data sets contains 100 positive and 100 negative bags. More details of the 
data sets can be found in [11,2]. 

We compare MIGraph, miGraph with Mi-Kernel [14] via ten times 10-fold 
cross validation (i.e., we repeat 10-fold cross validation for ten times with differ- 
ent random data partitions). All these methods use Gaussian RBF Kernel and 
the parameters are determined through cross validation on training sets. The 
average test accuracy and standard deviations are shown in Table 1^. The table 
also shows the performance of several other multi-instance kernel methods, in- 
cluding MI-SVM and mi-SVM [2], MissSVM [36] and PPMM kernel [28], and the 
famous Diverse Density algorithm [16] and its improvement EM-DD [34]. The 
results of all methods except Diverse Density were obtained via ten times 10-fold 
cross validation; they were the best results reported in literature and since they 
were obtained in different studies and the standard deviations were not available, 
these results are only for reference instead of a rigorous comparison. The best 
performance on each data set is bolded. 

Table 1 shows that the performance of MIGraph and miGraph are quite 
good. On Muskl they are only worse than PPMM kernel; note that the results 
of PPMM kernel were obtained through an exhaustive search that may be pro- 
hibitive in practice [28]. On Musk2, Elephant and Fox miGraph and MIGraph 
are respectively the best and second-best algorithms. Pairwise f-tests at 95% 
significance level indicate that miGraph is significantly better than Mi-Kernel 
on all data sets except that on Musk2 there is no significant difference. 



^ We have re-implemented Mi-Kernel since the comparison with Mi-Kernel will clearly 
show whether it is helpful to treat instances as non-i.i.d. samples (this is the only 
difference between our methods and Mi-Kernel). Note that the performance of Mi- 
Kernel in our implementation is better than that reported in [14]. 



Table 2. Accuracy (%) on image categorization 



Algorithm 


1000-Image 


2000-Image 


MIGraph 


83.9 : 


[81.2,85.7] 


72.1 : 


[71.0, 73.2] 


miGraph 


82.4 


[80.2, 82.6] 


70.5 


[68.7, 72.3] 


Mi-Kernel 


81.8 


[80.1,83.6] 


72.0 


[71.2,72.8] 


MI-SVM 


74.7 


[74.1, 75.3] 


54.6 


[53.1,56.1] 


DD-SVM 


81.5 


[78.5, 84.5] 


67.5 


[66.1,68.9] 


MissSVM 


78.0 


[75.8, 80.2] 


65.2 


[62.0, 68.3] 


femeans-SVM 


69.8 


[67.9, 71.7] 


52.3 


[51.6,52.9] 


MILES 


82.6 


[81.4,83.7] 


68.7 


[67.3,70.1] 



4.2 Image Categorization 

Image categorization is one of the most successful applications of multi-instance 
learning. The data sets 1000-Image and 2000-Image contain ten and twenty 
categories of COREL images, respectively, where each category has 100 images. 
Each image is regarded as a bag, and the ROIs (Region of Interests) in the image 
are regarded as instances described by nine features. More details of these data 
sets can be found in [7, 6] . 

We use the same experimental routine as that described in [6] . On each data 
set, we randomly partition the images within each category in half, and use one 
subset for training while the other for testing. The experiment is repeated for five 
times with five random splits, and the average results are recorded. One-against- 
one strategy is used by MIGraph, miGraph and Mi-Kernel for this multi-class 
task. Following the style of [7, 6], we present the overall accuracy as well as 95% 
confidence intervals in Table 2. For reference, the table also shows the best results 
of some other multi-instance learning methods reported in literature, including 
MLSVM [2,7], DD-SVM [7], fcmeans-SVM [10], MissSVM [36] and MILES [6]. 

It can be found from Table 2 that on the image categorization task our 
proposed MIGraph and miGraph are highly competitive with state-of-the-art 
multi-instance learning methods. In particular, MIGraph is the best performed 
method. This confirms our intuition that MIGraph is a good choice when each 
bag contains a few instances, and miGraph is better when each bag contains a 
lot of instances. 

By examining the detail results on 1000-Image, wc found that both MIGraph 
and miGraph or at least one of them are better than Mi-Kernel on most cate- 
gories, except on African and Dinosaurs. This might owe to the fact that the 
structure information of examples belonging to these complicated concepts'^ is 
too difficult to be captured by the simple schemes used in MIGraph and mi- 
Graph, while using incorrect structure information is worse than conservatively 
treating the instances as i.i.d. samples. 

For all three methods the largest errors occur between Beach and Mountains 
(the full name of this category is Mountains & glaciers). This phenomenon has 



^ Dinosaurs is complicated since it contains many different kinds of imaginary animals, 
toys and even bones. 



Table 3. Accuracy (%) on text categorization 



Data set Mi-Kernel miGraph 

alt.atheism 60.2 ± 3.9 65.5 ± 4.0 

comp. graphics 47.0 ± 3.3 77.8 ± 1.6 
comp.os.ms-windows.misc 51.0 ± 5.2 63.1 ± 1.5 
comp. sys.ibm.pc. hardware 46.9 ± 3.6 59.5 ± 2.7 

comp. sys. mac. harware 44.5 ± 3.2 61.7 ±4.8 

comp. window. X 50.8 ± 4.3 69.8 ± 2.1 

misc.forsale 51.8 ± 2.5 55.2 ± 2.7 

rec.autos 52.9 ± 3.3 72.0 ± 3.7 

rec.m,otorcycles 50.6 ± 3.5 64.0 ± 2.8 

rec. sport, baseball 51.7 ± 2.8 64.7 ±3.1 

rec. sport. hockey 51.3 ± 3.4 85.0 ± 2.5 

sci.crypt 56.3 ± 3.6 69.6 ± 2.1 

sci. electronics 50.6 ± 2.0 87.1 ± 1.7 

sct.med 50.6 ± 1.9 62.1 ± 3.9 

sci. space 54.7 ± 2.5 75.7 ± 3.4 

sci. religion, christian 49.2 ± 3.4 59.0 ± 4.7 

talk.politics.guns 47.7 ± 3.8 58.5 ± 6.0 

talk.politics.mideast 55.9 ± 2.8 73.6 ± 2.6 

talk.pohtics.misc 51.5 ± 3.7 70.4 ± 3.6 

talk. religion. misc 55.4 ± 4.3 63.3 ± 3.5 



been observed before [7, 6, 36], owing to the fact that many images of these two 
categories contain semantically related and visually similar regions such as those 
corresponding to mountain, river, lake and ocean. 

4.3 Text Categorization 

The twenty text categorization data sets were derived from the 20 Newsgroups 
corpus popularly used in text categorization. Fifty positive and fifty negative 
bags were generated for each of the 20 news categories. Each positive bag contains 
3% posts randomly drawn from the target category and the other instances 
(and all instances in negative bags) randomly and uniformly drawn from other 
categories. Each instance is a post represented by the top 200 TFIDF features. 

On each data set we run ten times 10- fold cross validation (i.e., we repeat 
10-fold cross validation for ten times with different random data partitions). 
MIGraph does not return results in a reasonable time, and so we only present 
the average accuracy with standard deviations of miGraph and Mi-Kernel in 
Table 3, where the best result on each data set is bolded. 

Pairwise i-tests at 95% significance level indicate that, miGraph is signif- 
icantly better than Mi-Kernel on all the text categorization data sets. It is 
impressive that, by examining the detail results we found that if we consider 
each time of the ten times 10-fold cross validation, the number of win/tie/lose 
of miGraph versus Mi-Kernel is 10/0/0 on 16 out of the 20 data sets, 9/0/1 on 



Table 4. Squared loss on multi-instance regression tasks 



Algorithm LJ160.1 LJ160.1S LJ80.1 LJ80.1S 

MIGraph 0.0080 0.0112 0.0111 0.0154 
miGraph 0.0084 0.0094 0.0118 0.0113 
Mi-Kernel 0.0116 0.0127 0.0174 0.0219 



DD 


0.0852 


0.0052 


N/A 0.1116 


BP-MIP 


0.0398 


0.0731 


0.0487 0.0752 


RBF-MIP 


0.0108 


0.0075 


0.0167 0.0448 



two data sets (talk. politics. guns and talk. religion. misc) , and 7/2/1 on the other 
two data sets (alt. atheism and misc. for sale) . 

4.4 Multi-Instance Regression 

We also compare MIGraph, miGraph and Mi-Kernel on four multi-instance re- 
gression data sets, including LJ-160. 166.1, LJ-160.166.1-S, LJ-80. 166.1 and LJ- 
80.166.1-S (abbreviated as LJ160.1, LJ160.1S, LJ80.1 and LJ80.1S, respectively). 
In the name LJ- r./.s, r is the number of relevant features, / is the number of 
features, and s is the number of scale factors used for the relevant features that 
indicate the importance of the features. The suffix S indicates that the data set 
uses only labels that are not near 1/2. More details of these data sets can be 
found in [1]. 

We perform leave-one-out tests and report the results in Table 4. For ref- 
erence, the table also shows the leave-one-out results of some other methods 
reported in literature, including Diverse Density [16, 1], BP-MIP and RBF-MIP 
[33]. In Table 4 the best performance on each data set is bolded. It is evident that 
our proposed miGraph and MIGraph methods also work well on multi-instance 
regression tasks. 

5 Conclusion 

Previous studies on multi-instance learning typically treated instances in the 
bags as i.i.d. samples, neglecting the fact that instances within a bag are ex- 
tracted from the same object, and therefore the instances are rarely i.i.d. intrin- 
sically and the relations among instances may convey important information. In 
this paper, we propose two methods which treat the instances in an non-i.i.d. 
way. Experiments show that our proposed methods are simple yet effective, with 
performances highly competitive with the best performing methods on several 
multi-instance classification and regression tasks. Note that our methods can 
also handle i.i.d. samples by using identity matrix. 

An interesting future issue is to design a better graph kernel to capture more 
useful structure information of multi-instance bags. Applying graph edit distance 
or metric learning methods to the graphs corresponding to multi-instance bags 



is also worth trying. The success of our proposed methods also suggests that it 
is possible to improve other multi-instance learning methods by incorporating 
mechanisms to exploit the relations among instances, which opens a promising 
future direction. Moreover, it is possible to extend our proposal to other set- 
tings such as generalized multi-instance learning, multi-instance semi-supervised 
learning, multi-instance active learning, multi-instance multi-label learning, etc. 
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